Cluster identification is similar to grouping similar things together. Say, if you have a basket full of different fruits, you would group apples with apples, bananas with bananas, and so on. In spatial transcriptomics, we are doing the same thing, but with complex gene expression data. The objective is to group together cells (or spatial locations) that exhibit similar gene expression profiles. These ‘clusters’ can provide insight into different cell types, states, or functional regions within the tissue, ultimately enabling researchers to gain a detailed understanding of biological processes and diseases. Further we can do clustering based on the gene expression, based on the spatial location or both together. I will guide you through the different options of spatial clustering and how we can use this for data integration.
There are plenty of different algorithm to perform spatial clustering. I will first show the a algorithm called “BayesSpace” which helps to identify transcriptional niches by integration the spatial localisation and gene expression similarity. The clustering can by performed by the following function:
# This will take some time ... take a coffee !
# If you want to speed the process up, the clustering is saved in the object so you dont have to run it...
object <- SPATA2::runBayesSpaceClustering(object)
#Plot:
SPATA2::plotSurface(object, color_by="bayes_space", pt_alpha=0.6, display_image = T)+
SPATA2::ggpLayerAxesSI(object)+
ggpLayerThemeCoords(unit = "mm")+
ggtitle("SNN Cluster")+
scale_color_manual(values=RColorBrewer::brewer.pal(8, "Set3"))
First we will compare both algorithms to check which cluster are similar and which are different. Therfore we can use mosaike plots which incooperate the number of data points and the distribution at the same time.
data <- SPATA2::joinWithFeatures(object, features=c("seurat_clusters", "bayes_space"))
library(ggmosaic)
ggplot(data = data) +
geom_mosaic(aes(x = product(seurat_clusters, bayes_space), fill=seurat_clusters)) +
scale_fill_manual(values=RColorBrewer::brewer.pal(8, "Set3"))+
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(colour = "black", size=0.5),
axis.text.x = element_text(colour="black", size=10),
axis.text.y = element_text(colour="white", size=10))+
coord_fixed()+
ggtitle("Compare SNN and BayesSpace")
The subsequent analysis for cluster characterization will follow the previously explained procedures. Given our expanded understanding of the biological functions, our focus will now shift to gaining deeper insights into cellular distribution across the clusters. To accomplish this, we will employ the method of spot deconvolution.
Imagine you have a cake with various layers of different flavors. Now, your task is to find out what percentage of each flavor is in a slice of the cake. That’s similar to what we’re doing in spatial transcriptomics with a process called “spot deconvolution”. We’re trying to figure out what types of cells (the ‘flavors’) are present in each spot (the ‘slice’) of a tissue sample (the ‘cake’). To do this, we start with the algorithm RCTD which is one of the most robust algorithm on the market. RCTD, or Reference Component Tissue Decomposition, is a method we use to figure out the ‘recipe’ of our slice. Here’s how it works in simple terms:
Making a Recipe Book: First, we use a set of single cells (which we already know a lot about) to make a ‘recipe book’. This book has the average gene expression (the ‘taste’ of the cell) for each cell type.
Matching Flavors: For each spot, RCTD tries to match the observed gene expression (the ‘taste’ of the spot) to the cell type expressions in our recipe book. It’s like trying to guess the cake recipe by tasting the slice.
Considering All Possibilities: But there might be many different recipes that could give the same taste. So instead of just guessing one recipe, RCTD makes a list of all possible recipes that could explain the taste of our slice.
Sampling Recipes: RCTD then randomly picks from these possible recipes many, many times. This gives us a spread of possible recipes for each spot.
By considering many possible ‘recipes’ for each spot, RCTD gives us a good idea of the different types of cells that could be in each spot, and how sure we are about these guesses. This is very useful for understanding the complex mix of cells in our tissue samples.
First we need a dataset (reference dataset) with single cells which need to contain defined labels of the cells. Here we use a downsized dataset of the GBMap (comprehansive reference atlas for GBM). The single cell data are normaly stored as seurat objects (more info on seurat: https://satijalab.org/seurat/). The seurat reference object is also uploaded.
ref <- readRDS("~/path/Down_scaled_reference.RDS")
# Colors of the cell types
colors <- readRDS("~/path/colors_cell_deconv.RDS")
col_cells <- colors$colors
names(col_cells) <- colors$annotation_level_4
# Plot the single cell data
f <- DimPlot(ref, group.by = "annotation_level_4")
library(scattermore)
f$data$annotation_level_4 <- factor(f$data$annotation_level_4, levels = colors$annotation_level_4)
p <- ggplot(data=f$data)+
geom_scattermore(mapping=aes(UMAP_1,UMAP_2), pointsize = 8, color="black")+
geom_scattermore(mapping=aes(UMAP_1,UMAP_2), pointsize = 7, color="white")+
geom_scattermore(mapping=aes(UMAP_1,UMAP_2, color=annotation_level_4), pointsize=4)
p <- p+
scale_colour_manual(values=col_cells)+
ylab("UMAP2")+xlab("UMAP1")+
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.text = element_text(size = 5),
legend.title = element_text(colour="white", size=3),
panel.background = element_rect(colour = "black", size=0.5),
axis.text.x = element_text(colour="black"),
axis.text.y = element_text(colour="black"))+
coord_fixed()+
ggtitle("Reference dataset")+
guides(color = guide_legend(override.aes = list(size = 4)))+
theme(legend.key.size = unit(0.5, "cm"))
p
We will use the SPATAWrappers extension to run the RCTD deconvolution:
# First install the package:
options(timeout = 600000000) ### set this to avoid timeout error
devtools::install_github("dmcable/spacexr", build_vignettes = FALSE)
#This step also take some time, so relax... if you want to speed up, the output is saved in the object...
# Run the analysis
object <- SPATAwrappers::runRCTD(object, ref=ref, cell_type_var = "annotation_level_4")
The analysis will output the celltype probabilities in the fdata slot of the spata object. From there we can visualize the cell distributions in the different clusters. For example let’s plot the tumor cell types across clusters:
tumor_celltypes <-
colors %>%
filter(annotation_level_2 %in% c("Stem-like", "Differentiated-like")) %>%
pull(annotation_level_4)
data <- joinWith(object, features = c("bayes_space", tumor_celltypes))
#
data <-
data %>%
pivot_longer(cols = tumor_celltypes) %>%
dplyr::select(c(bayes_space,name,value)) %>%
group_by(bayes_space,name) %>%
summarise(value=mean(value))
a <- ggplot(data = data) +
geom_col(aes(x =bayes_space, y=value, fill=name)) +
scale_fill_manual(values=col_cells)+
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(colour = "black", size=0.5),
axis.text.x = element_text(colour="black", size=10),
axis.text.y = element_text(colour="black", size=10))+
coord_fixed(ratio = 10)+
ggtitle("Cell types in cluster")
T_cells <-
colors %>%
filter(annotation_level_2 %in% c("Lymphoid")) %>%
pull(annotation_level_4)
data <- joinWith(object, features = c("bayes_space", T_cells))
#
data <-
data %>%
pivot_longer(cols = T_cells) %>%
dplyr::select(c(bayes_space,name,value)) %>%
group_by(bayes_space,name) %>%
summarise(value=mean(value))
b <- ggplot(data = data) +
geom_col(aes(x =bayes_space, y=value, fill=name)) +
scale_fill_manual(values=col_cells)+
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(colour = "black", size=0.5),
axis.text.x = element_text(colour="black", size=10),
axis.text.y = element_text(colour="black", size=10))+
coord_fixed(ratio = 120)+
ylim(0,0.05)+
ggtitle("Cell types in cluster")
a
b
From this data, we can interpretate that in cluster 1 (hypoxia) the most abundant cell type are mesenchymal tumor cells and the T cell fraction appeared significantly lower. The results can also be presented in a surface plot:
col=colorRampPalette(c("#FFFFFF", colors[colors$annotation_level_4=="MES_like_hypoxia_MHC", ]$colors))
a <-
SPATA2::plotSurface(object, color_by="MES_like_hypoxia_MHC", pt_alpha=1, display_image = F)+
scale_colour_gradientn(colours = col(50),limit=c(0,0.5),
oob = scales::squish, na.value = "white", name="Enrichment")+
guides(color = guide_colourbar(barwidth = 0.3, barheight = 8, ticks =F, frame.colour="black"), label=F)+
SPATA2::ggpLayerAxesSI(object)+
SPATA2::ggpLayerTissueOutline(object)+
ggpLayerThemeCoords(unit = "mm")+
ggtitle("Mes-Hypoxia")
col=colorRampPalette(c("#FFFFFF", colors[colors$annotation_level_4=="NPC_like_neural", ]$colors))
b <-
SPATA2::plotSurface(object, color_by="NPC_like_neural", pt_alpha=1, display_image = F)+
scale_colour_gradientn(colours = col(50),limit=c(0,0.5),
oob = scales::squish, na.value = "white", name="Enrichment")+
guides(color = guide_colourbar(barwidth = 0.3, barheight = 8, ticks =F, frame.colour="black"), label=F)+
SPATA2::ggpLayerAxesSI(object)+
SPATA2::ggpLayerTissueOutline(object)+
ggpLayerThemeCoords(unit = "mm")+
ggtitle("NPC")
a+b